Constructing Parallel Corpora for Six Indian Languages via Crowdsourcing
نویسندگان
چکیده
Recent work has established the efficacy of Amazon’s Mechanical Turk for constructing parallel corpora for machine translation research. We apply this to building a collection of parallel corpora between English and six languages from the Indian subcontinent: Bengali, Hindi, Malayalam, Tamil, Telugu, and Urdu. These languages are low-resource, under-studied, and exhibit linguistic phenomena that are difficult for machine translation. We conduct a variety of baseline experiments and analysis, and release the data to the community.
منابع مشابه
The Language Demographics of Amazon Mechanical Turk
We present a large scale study of the languages spoken by bilingual workers on Mechanical Turk (MTurk). We establish a methodology for determining the language skills of anonymous crowd workers that is more robust than simple surveying. We validate workers’ selfreported language skill claims by measuring their ability to correctly translate words, and by geolocating workers to see if they resid...
متن کاملCreating Multilingual Parallel Corpora in Indian Languages
This paper presents a description of the parallel corpora being created simultaneously in 12 major Indian languages including English under a nationally funded project named Indian Languages Corpora Initiative (ILCI) run through a consortium of institutions across India. The project runs in two phases. The first phase of the project has two distinct goals creating parallel sentence aligned corp...
متن کاملCross-Domain and Cross-Language Porting of Shallow Parsing
English was the main focus of attention of the Natural Language Processing (NLP) community for years. As a result, there are significantly more annotated linguistic resources in English than in any other language. Consequently, data-driven tools for automatic text or speech processing are developed mainly for English. Developing similar corpora and tools for other languages is an important issu...
متن کاملTransDoop: A Map-Reduce based Crowdsourced Translation for Complex Domains
Large amount of parallel corpora is required for building Statistical Machine Translation (SMT) systems. We describe the TransDoop system for gathering translations to create parallel corpora from online crowd workforce who have familiarity with multiple languages but are not expert translators. Our system uses a Map-Reduce-like approach to translation crowdsourcing where sentence translation i...
متن کاملBrahmi-Net: A transliteration and script conversion system for languages of the Indian subcontinent
We present Brahmi-Net an online system for transliteration and script conversion for all major Indian language pairs (306 pairs). The system covers 13 Indo-Aryan languages, 4 Dravidian languages and English. For training the transliteration systems, we mined parallel transliteration corpora from parallel translation corpora using an unsupervised method and trained statistical transliteration sy...
متن کامل